Predicting Mental Health Problems from Social Determinants and Caregiving Activities of Caregivers of Persons with Dementia: A Machine Learning Approach
BMIN503/EPID600 Final Project
Author
Hannah Cho
Updated 12/3/24
0.1 Overview
After consulting with Drs. Demiris and Huang from the School of Nursing, we decided to explore the intersection of social determinants, caregiving activities, and mental health outcomes in caregivers of persons with dementia. Our goal is to understand how these factors contribute to caregivers’ mental health challenges, with a particular focus on identifying predictive patterns through a machine learning approach.
0.2 Introduction
The challenges faced by dementia caregivers are deeply complex and multifaceted, encompassing not only the physical and emotional demands of caregiving but also social physical, and systemic barriers. These challenges are not apply physically demanding- such as providing around-the-clock care, assisting with activities of daily living, and managing the various health complications of dementia- but they also take a significant burden on the caregivers’ emotional well-being. Caregivers often experience stress, anxiety, depression, and a sense of isolation, as the demands of caregiving can leave little room for self-care, personal, or social engagement. Among them, caregivers for persons with dementia are put at high risks of anxiety and depression due to the nature of the diseases. The trajectory of dementia is not quietly common and varied based on individuals’ pre-exisitng problems and multicomorbidities. In addition to this uncertaintity, caregivers often face sigificant social and systemic barriers that limit their access to essential support services. These barriers include financial strain, a lack of accessible respite care, insufficient knowledge about available resources, and cultural or social stigma associated with caregiving. Many caregivers also experience isolation due to lack of social engagement.
Research Question: Which social strains and sociodemographic characteristics of caregivers most strongly predict anxiety and depression for caregiver of persons with living dementia, and how accurately can supervised machine learning models predict these outcomes?
0.3 Methods
Dataset: This study uses data from the National Health and Aging Trends Study (NHATS) Round 11 and the National Study of Caregiving (NSOC) Round 4, which include data collected in 2021. The NHATS is a publicly accessible dataset that includes a nationally representatative sample of adults aged 65 years old and older who are Medicare beneficiaries in the United States of America. The NSOC is conducted alongside the NHATS; participants in the NSOC are caregivers for older adults included in the NHATS. Both the NHATS and the NSOC were funded by the National Institute on Aging (R01AG062477; U01AG032947). When used together, the NHATS and NSOC provide valuable information on dyads of older adults receiving care and their family caregivers.
Samples: Persons with dementia: Probable dementia was identified based on one of the following criteria: a self-reported diagnosis of dementia or Alzheimer’s disease by a physician, a score of 2 or higher on the AD8 screening instrument administered to proxy respondents, or a score that is 1.5 standard deviations below the mean on a range of cognitive tests.Caregivers: Caregivers are identified from the NSOC and NHATS data set. Since this project specifically aims to explore caregivers of persons with dementia in the community, the sample was further filtered through dementia classification (demclass) and residency (r11dresid).
Afer retriving NHATS Round 11 and NSOC ROUND 4, I specifically selected the sample (from NHATS R11- r11demclas). And then, I merged those necessary datasets.
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2) #for data visualization #Bring datasets df1 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NHATS_Round_11_SP_File_V2.dta") # dementia classfication in this filedf2 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NSOC_r11.dta") #caregiver information 1df3 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NSOC_cross.dta") #caregiver information 2df4 <-read_dta("~/R HC/BMIN503_Final_Project/final final/NHATS_Round_11_OP_File.dta") #older adults information
#need to clean df1 first in order to classify dementia classes #ENTER WHICH ROUND?sp1 <- df1 |>mutate(rnd =11) #3. EDIT ROUND NUMBER INSIDE THE QUOTES #(THIS REMOVES THE PREFIXES ON NEEDED VARIABLES ) sp1 <- sp1 |>rename_all(~stringr::str_replace(.,"^r11","")) |>rename_all(~stringr::str_replace(.,"^hc11","")) |>rename_all(~stringr::str_replace(.,"^is11","")) |>rename_all(~stringr::str_replace(.,"^cp11","")) |>rename_all(~stringr::str_replace(.,"^cg11",""))#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1sp1 <- sp1 |>mutate(dad8dem =ifelse(rnd ==1, -1, dad8dem))#ADD R1DAD8DEM AND SET TO -1 FOR ROUND 1 BECAUSE THERE IS NO PRIOR DIAGNOSIS IN R1sp1 <- sp1 |>mutate(dad8dem =ifelse(rnd ==1, -1, dad8dem))#SUBSET NEEDED VARIABLESdf<-sp1 |> dplyr::select(spid, rnd, dresid, resptype, disescn9, chgthink1, chgthink2, chgthink3, chgthink4, chgthink5, chgthink6, chgthink7, chgthink8, dad8dem, speaktosp, todaydat1, todaydat2, todaydat3, todaydat4, todaydat5, presidna1, presidna3, vpname1, vpname3, quesremem, dclkdraw, atdrwclck, dwrdimmrc, dwrdlstnm, dwrddlyrc)#FIX A ROUND 2 CODING ERROR#df <- df |>mutate(dwrdimmrc =ifelse(dwrdimmrc==10& dwrddlyrc==-3& rnd==2, -3, dwrdimmrc))#CREATE SELECTED ROUND DEMENTIA CLASSIFICATION VARIABLE df <- df |>mutate(demclas =ifelse(dresid==3| dresid==5| dresid==7, -9, #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)ifelse((dresid==4& rnd==1) | dresid==6| dresid==8, -1, #SET MISSING (RESIDENTIAL CARE FQ ONLY) AND N.A. (NURSING HOME RESIDENTS, DECEASED)ifelse((disescn9==1| disescn9==7) &#CODE PROBABLE IF DEMENTIA DIAGNOSIS REPORTED BY SELF OR PROXY* (resptype==1| resptype==2), 1, NA))))#CODE AD8_SCORE*#INITIALIZE COUNTS TO NOT APPLICABLE*#ASSIGN VALUES TO AD8 ITEMS IF PROXY AND DEMENTIA CLASS NOT ALREADY ASSIGNED BY REPORTED DIAGNOSIS for(i in1:8){ df[[paste("ad8_", i, sep ="")]] <-as.numeric(ifelse(df[[paste("chgthink", i, sep ="")]]==2& df$resptype==2&is.na(df$demclas), 0, #PROXY REPORTS NO CHANGEifelse((df[[paste("chgthink", i, sep ="")]]==1| df[[paste("chgthink", i, sep ="")]] ==3) & df$resptype==2&is.na(df$demclas), 1, #PROXY REPORTS A CHANGE OR ALZ/DEMENTIA*ifelse(df$resptype==2&is.na(df$demclas), NA, -1)))) #SET TO NA IF IN RES CARE AND demclass=., OTHERWISE AD8 ITEM IS SET TO NOT APPLICABLE }#INITIALIZE COUNTS TO NOT APPLICABLE*for(i in1:8){ df[[paste("ad8miss_", i, sep ="")]] <-as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep ="")]]), 1,ifelse((df[[paste("ad8_", i, sep ="")]]==0| df[[paste("ad8_", i, sep ="")]]==1) & df$resptype==2&is.na(df$demclas), 0, -1)))}for(i in1:8){ df[[paste("ad8_", i, sep ="")]] <-as.numeric(ifelse(is.na(df[[paste("ad8_", i, sep ="")]]) &is.na(df$demclas) & df$resptype==2, 0, df[[paste("ad8_", i, sep ="")]]))}#COUNT AD8 ITEMS#ROUNDS 2+df <- df |>mutate(ad8_score =ifelse(resptype==2&is.na(demclas), (ad8_1 + ad8_2 + ad8_3 + ad8_4 + ad8_5 + ad8_6 + ad8_7 + ad8_8), -1)) %>%#SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 mutate(ad8_score =ifelse(dad8dem==1& resptype==2&is.na(demclas), 8, ad8_score)) %>%#SET PREVIOUS ROUND DEMENTIA DIAGNOSIS BASED ON AD8 TO AD8_SCORE=8 FOR ROUNDS 4-9mutate(ad8_score =ifelse(resptype==2& dad8dem==-1& chgthink1==-1& (rnd>=4& rnd<=9) &is.na(demclas) , 8, ad8_score)) #COUNT MISSING AD8 ITEMSdf <- df |>mutate(ad8_miss =ifelse(resptype==2&is.na(demclas),(ad8miss_1+ad8miss_2+ad8miss_3+ad8miss_4+ad8miss_5+ad8miss_6+ad8miss_7+ad8miss_8), -1))#CODE AD8 DEMENTIA CLASS #IF SCORE>=2 THEN MEETS AD8 CRITERIA#IF SCORE IS 0 OR 1 THEN DOES NOT MEET AD8 CRITERIAdf <- df |>mutate(ad8_dem =ifelse(ad8_score>=2, 1,ifelse(ad8_score==0| ad8_score==1| ad8_miss==8, 2, NA)))#UPDATE DEMENTIA CLASSIFICATION VARIABLE WITH AD8 CLASSdf <- df |>#PROBABLE DEMENTIA BASED ON AD8 SCORE mutate(demclas =ifelse(ad8_dem==1&is.na(demclas), 1, #NO DIAGNOSIS, DOES NOT MEET AD8 CRITERION, AND PROXY SAYS CANNOT ASK SP COGNITIVE ITEMS*ifelse(ad8_dem==2& speaktosp==2&is.na(demclas), 3, demclas)))####CODE DATE ITEMS AND COUNT #CODE ONLY YES/NO RESPONSES: MISSING/NA CODES -1, -9 LEFT MISSING*#2: NO/DK OR -7: REFUSED RECODED TO : NO/DK/RF*#****ADD NOTES HERE ABOUT WHAT IS HAPPENING IN ROUNDS 1-3, 5+ VS. ROUND 4 #*for(i in1:5){ df[[paste("date_item", i, sep ="")]] <-as.numeric(ifelse(df[[paste("todaydat", i, sep ="")]]==1, 1,ifelse(df[[paste("todaydat", i, sep ="")]]==2| df[[paste("todaydat", i, sep ="")]]==-7, 0, NA)))}#COUNT CORRECT DATE ITEMSdf <- df |>mutate(date_item4 =ifelse(rnd==4, date_item5, date_item4)) %>%mutate(date_sum = date_item1 + date_item2 + date_item3 + date_item4) %>%#PROXY SAYS CAN'T SPEAK TO SPmutate(date_sum =ifelse(speaktosp==2&is.na(date_sum),-2, #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER*ifelse((is.na(date_item1) |is.na(date_item2) |is.na(date_item3) |is.na(date_item4)) & speaktosp==1,-3, date_sum))) %>%#MISSING IF PROXY SAYS CAN'T SPEAK TO SP* mutate(date_sumr =ifelse(date_sum ==-2 , NA, #0 IF SP UNABLE TO ANSWER*ifelse(date_sum ==-3 , 0, date_sum)))########PRESIDENT AND VICE PRESIDENT NAME ITEMS AND COUNT########## ##CODE ONLY YES/NO RESPONSES: MISSING/N.A. CODES -1,-9 LEFT MISSING *##2:NO/DK OR -7:REFUSED RECODED TO 0:NO/DK/RF*df <- df |>mutate(preslast =ifelse(presidna1 ==1, 1,ifelse(presidna1 ==2| presidna1 ==-7, 0, NA))) |>mutate(presfirst =ifelse(presidna3 ==1, 1,ifelse(presidna3 ==2| presidna3 ==-7, 0, NA))) |>mutate(vplast =ifelse(vpname1 ==1, 1,ifelse(vpname1 ==2| vpname1 ==-7, 0, NA))) |>mutate(vpfirst =ifelse(vpname3 ==1, 1,ifelse(vpname3 ==2| vpname3 ==-7, 0, NA))) |>#COUNT CORRECT PRESIDENT/VP NAME ITEMS*mutate(presvp = preslast + presfirst + vplast + vpfirst) |>#PROXY SAYS CAN'T SPEAK TO SP mutate(presvp =ifelse(speaktosp ==2&is.na(presvp), -2, #PROXY SAYS CAN SPEAK TO SP BUT SP UNABLE TO ANSWER ifelse((is.na(preslast) |is.na(presfirst) |is.na(vplast) |is.na(vpfirst)) & speaktosp==1&is.na(presvp),-3, presvp))) |>#MISSING IF PROXY SAYS CAN’T SPEAK TO SP*mutate(presvpr =ifelse(presvp ==-2 , NA, ifelse(presvp ==-3 , 0, presvp))) |>#ORIENTATION DOMAIN: SUM OF DATE RECALL AND PRESIDENT/VP NAMING* mutate(date_prvp = date_sumr + presvpr)#######EXECUTIVE FUNCTION DOMAIN: CLOCK DRAWING SCORE###########RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUND 10 ONLY)* df <- df |>mutate(dclkdraw =ifelse(speaktosp ==2& dclkdraw ==-9& rnd==10, -2,ifelse(speaktosp==1& (quesremem==2| quesremem==-7| quesremem==-8) & dclkdraw==-9& rnd==10, -3,ifelse(atdrwclck==2& dclkdraw==-9& rnd==10, -4,ifelse(atdrwclck==97& dclkdraw==-9& rnd==10, -7, dclkdraw)))))#RECODE DCLKDRAW TO ALIGN WITH MISSING VALUES IN PREVIOUS ROUNDS (ROUNDS 11 AND FORWARD ONLY)* df<-df |>mutate(dclkdraw =ifelse(speaktosp ==2& dclkdraw ==-9& rnd>=11, -2, ifelse(speaktosp ==1& (quesremem ==2| quesremem ==-7| quesremem ==-8) & dclkdraw ==-9, -3& rnd>=11, dclkdraw))) df<-df |>mutate(clock_scorer =ifelse(dclkdraw ==-3| dclkdraw ==-4| dclkdraw ==-7, 0,#IMPUTE MEAN SCORE TO PERSONS MISSING A CLOCK*#IF PROXY SAID CAN ASK SP*ifelse(dclkdraw ==-9& speaktosp ==1, 2, #IF SELF-RESPONDENT* ifelse(dclkdraw ==-9& speaktosp ==-1, 3, ifelse(dclkdraw ==-2| dclkdraw ==-9, NA, dclkdraw)))))#MEMORY DOMAIN: IMMEDIATE AND DELAYED WORD RECALL df <- df |>mutate(irecall =ifelse(dwrdimmrc ==-2| dwrdimmrc ==-1, NA,ifelse(dwrdimmrc ==-7| dwrdimmrc ==-3, 0, dwrdimmrc))) |>mutate(irecall =ifelse(rnd==5& dwrddlyrc==-9, NA, irecall)) |>#round 5 only: set cases with missing word list and not previously assigned to missingmutate(drecall =ifelse(dwrddlyrc ==-2| dwrddlyrc ==-1, NA,ifelse(dwrddlyrc ==-7| dwrddlyrc ==-3, 0, dwrddlyrc))) |>mutate(drecall =ifelse(rnd==5& dwrddlyrc==-9, NA, drecall)) |>#round 5 only: set cases with missing word list and not previously assigned to missingmutate(wordrecall0_20 = irecall+drecall)#CREATE COGNITIVE DOMAINS FOR ALL ELIGIBLE df<-df |>mutate(clock65 =ifelse(clock_scorer ==0| clock_scorer==1, 1, ifelse(clock_scorer >1& clock_scorer<6, 0, NA)))df<-df |>mutate(word65 =ifelse(wordrecall0_20 >=0& wordrecall0_20 <=3, 1, ifelse(wordrecall0_20 >3& wordrecall0_20 <=20, 0, NA)))df<-df |>mutate(datena65 =ifelse(date_prvp >=0& date_prvp <=3, 1, ifelse(date_prvp >3& date_prvp <=8, 0, NA)))# *CREATE COGNITIVE DOMAIN SCORE*df<-df |>mutate(domain65 = clock65+word65+datena65)#*SET CASES WITH MISSING WORD LIST AND NOT PREVIOUSLY ASSIGNED TO MISSING (ROUND 5 ONLY)df<-df |>mutate(demclas =ifelse(rnd==5& dwrdlstnm==-9&is.na(demclas), -9, demclas))#UPDATE COGNITIVE CLASSIFICATION*df<-df |>#PROBABLE DEMENTIAmutate(demclas =ifelse(is.na(demclas) & (speaktosp ==1| speaktosp ==-1) & (domain65==2| domain65==3), 1,#POSSIBLE DEMENTIAifelse(is.na(demclas) & (speaktosp ==1| speaktosp ==-1) & domain65==1, 2,#NO DEMENITA ifelse(is.na(demclas) & (speaktosp ==1| speaktosp ==-1) & domain65==0, 3, demclas))))#KEEP VARIABLES AND SAVE DATAdf<-df |> dplyr::select(spid, rnd, demclas)#CHANGE # AFTER "r" TO THE ROUND OF INTERESTr11demclas <- df#4. NAME AND SAVE DEMENTIA DATA FILE:#CHANGE # AFTER "r" TO THE ROUND OF INTERESTsave(r11demclas, file ="~/R HC/BMIN503_Final_Project/final final/NHATS_r11.dta")
Merging the data set
#merged datasets (md). md1 <- left_join(df, df1, by = "spid") md2 <- left_join(md1, df3, by = "spid")#merged datasets (md). md1 <-left_join(df, df1, by ="spid")md2 <-left_join(md1, df3, by ="spid")# choose probable dementia and dementia patients who live at homedementia1 <- md2 |>filter(demclas %in%c("1", "2") & (r11dresid %in%c("1")))dementia2 <- md2 |>filter(demclas %in%c("1", "2") & (r11dresid %in%c("1", "2")))
Predictors: Caregiver level factors are identified as caregivers’ age, race, gender, self-reported income, and the highest education level. Also, these are recoded accordingly. The education level of the caregivers was categorized as “Less than high school (0)”, “High School (1)”, and “College or above (2).” For economic status, the caregivers’ reported income from the previous year was used. This study included both informal and formal support as part of the caregivers’ social determinants of health. Informal support included having friends or family (a) to talk to about important life matters, (b) to help with daily activities, such as running errands, and (c) to assist with care provision.10 Formal support included (a) participation in a support group for caregivers, (b) access to respite services that allowed the caregiver to take time off, and (c) involvement in a training program that assisted the caregiver in providing care for the care recipient.10 We used these individual items as support questions and each support question was answered by indicating whether or not they received support.
# Caregiver's Age (renaming variable) #chd11dage#Race# Recode `race` to create a new binary variable# 1 for "White, non-Hispanic" and 0 for "Non-White"# Recode race to create a new variable 'race_recode'dementia1 <- dementia1 |>mutate(race_recode =case_when( crl11dcgracehisp ==1~0, # White, non-Hispanic crl11dcgracehisp ==2~1, #black, non-hispanic crl11dcgracehisp ==3~2, # others crl11dcgracehisp ==4~3, chd11educ %in%c(5, 6) ~NA_real_, # Missing or not applicableTRUE~NA_real_# Unhandled cases# hispanics )) table(dementia1$crl11dcgracehisp)
1 2 3 4 5 6
250 193 16 46 1 22
table(dementia1$race_recode)
0 1 2 3
250 193 16 46
# Gender: Male as reference (0), Female as 1dementia1 <- dementia1 |>mutate(gender_recode =case_when(as.character(c11gender) =="1"~0, # Maleas.character(c11gender) =="2"~1, # FemaleTRUE~NA_real_# Handle any other unexpected cases ) )# Education: Recoding education levels into two categoriestable(dementia1$chd11educ)
dementia1 <- dementia1 |>mutate(edu_recode =case_when( chd11educ %in%c(1, 2, 3) ~1, # Below and high school diploma chd11educ %in%c(4, 5) ~1, # Some college chd11educ %in%c(6, 7, 8) ~2, # College and beyond chd11educ ==c(9) ~3, chd11educ %in%c(-8, -7, -6) ~NA_real_, # Missing or not applicableTRUE~NA_real_# Unhandled cases ) )table(dementia1$edu_recode)
1 2 3
209 234 69
# Marital Status: Recoding marital status into binary (married vs. not married)dementia1 <- dementia1 |>mutate(martstat_recode =case_when( chd11martstat ==1~0, # Married chd11martstat %in%2:6~1, # Not married (single, divorced, etc.) chd11martstat %in%c(-8, -6) ~NA_real_, # Missing or not applicableTRUE~NA_real_# Unhandled cas ) )table(dementia1$martstat_recode)
0 1
204 232
Caregivers’ caregiving activities
#recoding: 1(everyday), 2(most day), 3(someday)–1; 4(rarely),5(never)0 #C11 CA1 HOW OFT HELP WITH CHORES (cca11hwoftchs) #C11 CA2 HOW OFTEN SHOPPED FOR SP (cca11hwoftshp) #C11 CA6 HOW OFT HELP PERS CARE (cca11hwoftpc ) #C11 CA6B1 HELP CARE FOR TEETH (PREVIOUSLY CA11F) #C11 CA7 HOW OFT HLP GTNG ARD HOME (cca11hwofthom) #11 CA9 HOW OFTEN DROVE SP (cca11hwoftdrv) #C11 CA10 OFTN WENT ON OTH TRANSPR (cca11hwoftott )
#caregiver’s caregiving activities cca11hwoftchs #C11 CA1 HOW OFT HELP WITH CHORES cca11hwoftshp #C11 CA2 HOW OFTEN SHOPPED FOR SP cca11hwoftpc #C11 CA6 HOW OFT HELP PERS CARE cca11hwofthom #C11 CA7 HOW OFT HLP GTNG ARD HOME cca11hwoftdrv #11 CA9 HOW OFTEN DROVE SP cca11hwoftott #C11 CA10 OFTN WENT ON OTH TRANSPR
1 caregiver’s features
che11enrgylmt #Energy often limited cac11diffphy # Caregiver physical difficulty helping cac11exhaustd #Caregiver exhausted at night cac11toomuch #Care more than can handle cac11uroutchg #Care routine then changes cac11notime #No time for self (-8,-7,-6) cac11diffemlv #Caregiver emotional difficulty (-1, 1-5) cpp11hlpkptgo #Kept from going out (-6-1, 1,2) che11health #General health (-8, 1-5) che11sleepint #Interrupted sleep (-8,-7,-6, 1-5) op11numhrsday #Number of hours per day help (-7,-1, 1-6) op11numdaysmn #Number of days help per month (-7,-1 1-6)
#Sociodemographic features (persons living with dementia and caregiver) table(dementia1$cac11notime)
op11leveledu #Caregiver education (na -1, 1-5) cac11diffinc #Caregiver financial difficulties (-8,-7,-6, binary 1,2) ew11progneed1 #Persons living with dementia received food stamps (-8, -7 binary) ew11finhlpfam #Persons living with dementia financial help from family (-8, -7) binary mc11havregdoc #Persons living with dementia have a regular doctor (1,2binary) hc11hosptstay #Persons living with dementia hospital stay in last (1,2) 12-months hc11hosovrnht #Persons living with dementia number of hospital stays (-7, -1, 1-6 times)
Outcomes: Caregivers’ anxiety and depressive symptoms are measured by two questions each.First, anxiety was measured Generalized Anxiety Disorder-2 (GAD-2) Scale which consists of two questions. Since the NHATS provided GAD-2 data, this study utilized it to measure anxiety levels among care recipients. Each item on the scale is rated on a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting in a total score between 0 and 6. Higher scores correspond to greater anxiety, with a total GAD-2 score of 3 or more indicating anxiety.
The care recipients’ depression was evaluated using the Patient Health Questionnaire-2 (PHQ-2) Scale. Given that the NHATS included PHQ-2, this study utilized it to measure depression in care recipients. Each item on the scale was measured with a four-point Likert scale, ranging from 0 (not at all) to 3 (nearly every day), resulting a total score between 0 and 6, with higher scores indicating more severe depression. A PHQ-2 score ranges from 0-6. The authors identified a score of 3 as the optimal cutpoint when using the PHQ-2 to screen for depression. If the score is 3 or greater, major depressive disorder is likely.
# Sum the two questions for GAD2dementia1$total_gad2 <- dementia1$che11fltnervs + dementia1$che11fltworry# Recode the combined variable using a cut-off of 3dementia1$gad2_cg_cat <-ifelse(dementia1$total_gad2 <3, 0, 1)table(dementia1$gad2_cg_cat)
0 1
279 249
summary(dementia1$gad2_cg_cat) #1 ~ anxiety
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.4716 1.0000 1.0000 251
# Sum of the two questions for PHQ2 (che11fltltlin + che11fltdown) dementia1$total_phq2 <- dementia1$che11fltltlin+ dementia1$che11fltdown#Recode the combined variable using a cut-off of 3dementia1$phq2_cg_cat <-ifelse(dementia1$total_phq2 <3, 0, 1)table(dementia1$phq2_cg_cat)
0 1
276 252
summary(dementia1$phq2_cg_cat) #1 ~ depression
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.4773 1.0000 1.0000 251
Data analysis
For data analysis, we first conducted descriptive analyses, including means, standard deviations, ranges, and percentages, to summarize the dataset. To investigate how caregivers’ social strains and caregiver-level factors influence caregiver depression, we performed logistic regression analyses. Guided by the conceptual framework of this study, univariate logistic regression analyses were employed to identify caregivers’ social strains and caregiver-level factors significantly associated with caregiver anxiety and depression, controlling for care recipient-level factors. Variables with a p-value below 0.05 in the univariate analyses were included in the subsequent multivariate logistic regression model. The multivariate model was then constructed to determine which factors most strongly influenced caregiver anxiety and depression. All statistical analyses were conducted using R, with statistical significance set at a p-value of less than 0.05.
# Install and load pheatmap if necessarylibrary(pheatmap)# Create the heatmappheatmap(cor_matrix, color =colorRampPalette(c("blue", "white", "red"))(50))
Since my outcomes are GAD-2 and PHQ-2, I am most focused on those variables surrounding the square: che11energylmt, che11health, cac11diffemlv, hc11hosovrnht, marstat_recode, race_recode, mc11havregdoc, gender_recode, cca11hwoftott, ia11totinc, ew11finhlpfam, and ew11progneed.
Pearson’s correlation matrices were presented as heatmaps for both rounds (5 and 7) to visually assess the data and evaluate the independence of variables (Supplementary Figure 2a and b). The Pearson’s correlation coefficient quantifies the linear relationship between two continuous variables, with values ranging from −1 to +1. A coefficient of 0 indicates no linear correlation, while negative and positive values represent negative and positive correlations, respectively. A p-value of < .05 was used to define statistical significance. Then I selected the most representative variable to reduce redundancy in concepts among highly correlated items. The negative values in the dataset were not addressed in this step.
1.1.1 Feature selection
The original dataset includes numerous negative and N/A values, and the sample size is small, necessitating preprocessing before feature selection. To handle the small sample size, negative values were transformed into N/A values. These N/A values were then imputed based on the type of feature. For continuous variables, N/A values were replaced with the median of the respective column, while the most frequent category level was used to impute N/A values for categorical features.
#coverting negative value to NAdementia_subset[dementia_subset <0] <-NA#imputationcontinuous_vars <-sapply(dementia_subset, is.numeric) # categoricaldementia_subset[continuous_vars] <-lapply(dementia_subset[continuous_vars], function(x) ifelse(is.na(x), median(x, na.rm =TRUE), x))# categorical: change NA to maxcategorical_vars <-sapply(dementia_subset, is.factor) # finddementia_subset[categorical_vars] <-lapply(dementia_subset[categorical_vars], function(x) {levels(x) <-append(levels(x), names(sort(table(x), decreasing =TRUE))[1]) ifelse(is.na(x), names(sort(table(x), decreasing =TRUE))[1], x) })
#final cleaning for dataset#choosing only one caregiver for each participantfinal <- dementia_subset |>group_by(spid) |>slice_head(n =1) |>ungroup()#total 563 caregivers #creating new dataset for each outcomeanxiety <-subset( final,select =c(gad2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, chd11dage, martstat_recode))depression <-subset( final,select =c(phq2_cg_cat, cca11hwoftchs, cca11hwoftshp, cca11hwoftott, che11enrgylmt, cac11diffphy, cac11exhaustd, cac11diffemlv, che11health, che11sleepint, ew11progneed1, race_recode, gender_recode, edu_recode, chd11dage, martstat_recode))
1.2 Results
This exploratory study employs multiple machine learning techniques—including correlation matrix analysis, glm, and random forest (RF)—to identify key predictors of caregiver depression and anxiety. This multipronged approach is essential given the diverse types of data in this study. Machine learning methods, being inductive, support hypothesis generation and allow for systematic feature reduction by excluding variables deemed unimportant across multiple methods. This approach refines the feature set, enhancing the interpretability and predictive accuracy of the models.
### anxiety_glm_wf <-workflow() |>add_model(lr_class_spec) |>add_formula(gad2_cg_cat ~ .)# Fit the workflow to the test dataanxiety_glm_fit <- anxiety_glm_wf |>fit(data = anxiety_test)# Generate predictions with probabilitiesanxiety_glm_predicted <-predict(anxiety_glm_fit, new_data = anxiety_test, type ="prob")anxiety_glm_predicted
# Combine into a single data frameanxiety_glm_pred_values <-bind_cols(truth = anxiety_test$gad2_cg_cat, # Actual values of the outcome variablepredict(anxiety_glm_fit, new_data = anxiety_test), # Predicted class labelspredict(anxiety_glm_fit, new_data = anxiety_test, type ="prob") # Predicted probabilities)print(anxiety_glm_pred_values)
#Prediction on the test dataanxiety.lr.pred.values.test <-bind_cols(truth = anxiety_test$gad2_cg_cat,predict(lr_class_fit, anxiety_test),predict(lr_class_fit, anxiety_test, type ="prob"))anxiety.lr.pred.values.test
### depression_glm_wf <-workflow() |>add_model(lr_class_spec) |>add_formula(phq2_cg_cat ~ .)# Fit the workflow to the test datadepression_glm_fit <- depression_glm_wf |>fit(data = depression_test)# Generate predictions with probabilitiesdepression_glm_predicted <-predict(depression_glm_fit, new_data = depression_test, type ="prob")depression_glm_predicted
# Combine into a single data framedepression_glm_pred_values <-bind_cols(truth = depression_test$phq2_cg_cat, # Actual values of the outcome variablepredict(depression_glm_fit, new_data = depression_test), # Predicted class labelspredict(depression_glm_fit, new_data = depression_test, type ="prob") # Predicted probabilities)print(depression_glm_pred_values)
#Prediction on the test datadepression.lr.pred.values.test <-bind_cols(truth = depression_test$phq2_cg_cat,predict(lr_class_fit, depression_test),predict(lr_class_fit, depression_test, type ="prob"))depression.lr.pred.values.test
# Fit the random forest model on the full training dataanxiety_rf_fit <- rf_spec |>fit(gad2_cg_cat ~ ., data = anxiety_train)anxiety_rf_fit
parsnip model object
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5, x), importance = ~TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3
OOB estimate of error rate: 17.56%
Confusion matrix:
0 1 class.error
0 293 33 0.1012270
1 46 78 0.3709677
#testinganxiety_rf_pred_values <-bind_cols(truth = anxiety_test$gad2_cg_cat, # Actual values of the outcome variablepredict(anxiety_rf_fit, new_data = anxiety_test), # Predicted class labelspredict(anxiety_rf_fit, new_data = anxiety_test, type ="prob") # Predicted probabilities)roc_auc(anxiety_rf_pred_values, truth, .pred_0)
# Collect metrics from the resampling resultsrf_wf_fit_cv_anxiety_metrics <-collect_metrics(rf_wf_fit_cv_anxiety)#roc_auc = 0.9213547 # If you need predictions for further analysisrf_wf_fit_cv_anxiety_preds <-collect_predictions(rf_wf_fit_cv_anxiety)
Depression
rf_spec<-rand_forest(trees=1000, min_n=5)|>set_engine("randomForest", importance=TRUE)|>set_mode("classification")depression_rf_fit<-rf_spec|>fit(phq2_cg_cat ~ ., data=depression_train)## top variablesdepression_rf_fit|>extract_fit_engine()|>vip()
# Fit the random forest model on the full training datadepression_rf_fit <- rf_spec |>fit(phq2_cg_cat ~ ., data = depression_train)depression_rf_fit
parsnip model object
Call:
randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, nodesize = min_rows(~5, x), importance = ~TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3
OOB estimate of error rate: 18.22%
Confusion matrix:
0 1 class.error
0 303 30 0.09009009
1 52 65 0.44444444
#testingdepression_rf_pred_values <-bind_cols(truth = depression_test$phq2_cg_cat, # Actual values of the outcome variablepredict(depression_rf_fit, new_data = depression_test), # Predicted class labelspredict(depression_rf_fit, new_data = depression_test, type ="prob") # Predicted probabilities)roc_auc(depression_rf_pred_values, truth, .pred_0)